New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Intel Mkl] Parallel BiasAddGrad op with eigen intra thread pool #26426

Merged

tensorflow-copybara merged 8 commits into tensorflow:master from Intel-tensorflow:letian/biasaddgrad_opt

Mar 25, 2019

Contributor

wenxizhu commented Mar 7, 2019

The BiasAddGrad op is running on single thread, which badly influences the training performance of corresponding models.

We provide a optimized parallel implementation of BiasAddGrad op.

The following is the time cost for original and optimized BiasAddGrad op of some corresponding models. All run on skx-8180 single socket:
YoloV2 (batch size 16) : 263.76ms -> 12.06ms
Trans-LT (batch size 1024) : 68.61ms -> 3.09ms
Inception-ResV2 (batch size 64) : 334.96ms -> 73.18ms
NCF (batch size 1024) : 1.34ms -> 0.243ms
NCF-mlperf (batch size 1024) : 0.867ms -> 0.341ms
MaskRCNN (batch size 1) : 370.57ms -> 15.23ms
DCGAN (batch size 32) : 10.46ms -> 0.512ms


          The BiasAddGrad op is running on single thread, which badly

7e3a395

influences the training performance of corresponding models.

We provide a optimized parallel implementation of BiasAddGrad op.

Change-Id: I3a8da878eea67a4903a3b68302c6c86c3e536025

googlebot added the cla: yes label

rthadur requested a review from ezhulenev

March 7, 2019 20:59

rthadur self-assigned this

rthadur added this to Assigned Reviewer in PR Queue via automation

rthadur added awaiting review size:M labels

ezhulenev reviewed

View reviewed changes

Member

ezhulenev left a comment •

edited

EDIT: I prepared this benchmark myself, should be in GitHub soon.

Original:
Could you also add bias_op_test.cc with micro benchmarks, you can use https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/sparse_xent_op_test.cc as an example. I will be able to use them to compare improvements with clang toolchain.

tensorflow/core/kernels/bias_op.cc Outdated

               #include "tensorflow/core/framework/bounds_check.h"
               #include "tensorflow/core/framework/numeric_op.h"
               #include "tensorflow/core/framework/op_kernel.h"
               #include "tensorflow/core/framework/register_types.h"
               #include "tensorflow/core/framework/tensor.h"
               #include "tensorflow/core/util/tensor_format.h"
+              #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+              #include "tensorflow/core/util/work_sharder.h"

Member

ezhulenev Mar 7, 2019

Could you please group all tensorflow includes together.

Contributor Author

wenxizhu Mar 11, 2019

Done.

tensorflow/core/kernels/bias_op.cc Outdated

                   }
                 }
                private:
                 TensorFormat data_format_;
+                // Modified for performance tune :: new funcs

Member

ezhulenev Mar 7, 2019

I think you can strip all "Modified for performance" comments, they are not really helpful for the reader.

Contributor Author

wenxizhu Mar 11, 2019

Removed.

tensorflow/core/kernels/bias_op.cc Outdated

+                inline void InplaceVecAdd(T* X, const T* Y, const int64 length) {
+                  //#pragma simd
+                  for (int64 i = 0; i < length; i++) {
+                    X[i] = X[i] + Y[i];

Member

ezhulenev Mar 7, 2019

Do you know if this pragma works with clang? This loop can be expressed as simple Eigen expression, something like:

    using X = Eigen::Map<Eigen::Array<T, Dynamic, 1>>;
    using Y = Eigen::Map<const Eigen::Array<T, Dynamic, 1>>
    X(x, length) += Y(y, length);

I probably messed up with some constructor parameters, but similar idea is used in https://bitbucket.org/eigen/eigen/src/9c300336de9a096c7c4c3b230a6ae9f49fa17aea/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h#lines-494:506

This will be vectorized/unrolled

Contributor Author

wenxizhu Mar 11, 2019

Ok, repalced them with eigen apis. And I also figured there is no need to keep the InplaceVecAdd() and VecSumReduce() fuctions. I inlined them into the lambda function.

tensorflow/core/kernels/bias_op.cc Outdated

+                // Modified for performance tune :: new funcs
+                // Apply X[0:length-1] = X[0:length-1] + Y[0:length-1];
+                inline void InplaceVecAdd(T* X, const T* Y, const int64 length) {

Member

ezhulenev Mar 7, 2019

function arg names must be 'x' and 'y'

Contributor Author

wenxizhu Mar 11, 2019

This function has been removed, so that's not a problem now.

tensorflow/core/kernels/bias_op.cc Outdated

+                  }
+                }
+                // Return sum(X[0:length-1])
+                inline T VecSumReduce(const T* X, const int64 length) {

Member

ezhulenev Mar 7, 2019

a. method arg name
b. it also could be return Eigen::Map<...>(x, length).sum();

I didn't find any other simd pragmas in Tensorflow codebase, so my assumption it's not supported by all compilers.

Contributor Author

wenxizhu Mar 11, 2019

This function has been removed, so that's not a problem now.

ezhulenev reviewed

View reviewed changes

tensorflow/core/kernels/bias_op.cc Outdated

-                              .template cast<typename AccumulatorType<T>::type>()
-                              .reshape(two_dims)
-                              .sum(reduction_axis)
-                              .template cast<T>();

Member

ezhulenev Mar 7, 2019

I think all performance problems with this grad were fixed in https://bitbucket.org/eigen/eigen/commits/948b43eaf2c5c35c988308d120a8a24781c63de6, what version of TF you used in benchmarks?

Contributor Author

wenxizhu Mar 11, 2019

According to my measurement, I believe my implementation is still faster than this commit you mentioned. So could I keep using my implementation.

Member

ezhulenev commented Mar 8, 2019

I did run benchmarks from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc and got pretty good results already, feel free to add more benchmarks, probably you want to test NCHW as well.

BM_BiasAddGradNHWC_32_32_32_128_cpu               4.84ms ± 1%             2.30ms ± 1%  -52.46%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_32_32_256_cpu               9.66ms ± 2%             4.15ms ± 1%  -57.08%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_32_32_512_cpu               80.1ms ±13%             18.4ms ± 4%  -76.97%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_32_32_1024_cpu               242ms ± 2%               41ms ±17%  -83.17%          (p=0.000 n=9+9)
BM_BiasAddGradNHWC_32_64_64_128_cpu               58.3ms ± 3%             18.5ms ± 5%  -68.16%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_256_cpu                177ms ± 5%               44ms ±12%  -75.48%         (p=0.000 n=9+10)
BM_BiasAddGradNHWC_32_64_64_512_cpu                452ms ± 6%               92ms ± 8%  -79.75%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_1024_cpu               1.19s ± 5%              0.23s ± 7%  -80.93%        (p=0.000 n=10+10)

name                                              old allocs/op           new allocs/op           delta
BM_BiasAddGradNHWC_32_32_32_128_cpu                  106 ± 0%                 90 ± 0%  -15.09%         (p=0.000 n=10+9)
BM_BiasAddGradNHWC_32_32_32_256_cpu                  170 ± 0%                 90 ± 0%  -47.06%         (p=0.000 n=10+9)
BM_BiasAddGradNHWC_32_32_32_512_cpu                  170 ± 0%                 90 ± 1%  -47.35%         (p=0.000 n=8+10)
BM_BiasAddGradNHWC_32_32_32_1024_cpu                 170 ± 0%                 90 ± 1%  -47.33%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_128_cpu                  107 ± 1%                 90 ± 0%  -15.57%         (p=0.000 n=10+9)
BM_BiasAddGradNHWC_32_64_64_256_cpu                  171 ± 0%                 90 ± 0%  -47.37%          (p=0.001 n=8+9)
BM_BiasAddGradNHWC_32_64_64_512_cpu                  170 ± 0%                 90 ± 1%  -47.33%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_1024_cpu                 170 ± 0%                 90 ± 1%  -47.48%        (p=0.000 n=10+10)

tensorflowbutler removed the awaiting review label


          Codes normalization:

d18de10

  1. Group all tensorflow includes together.
  2. Delete unnecessary comments.
  3. Using eigen ops instead.

Change-Id: I67edcd71cb4feeaf8ab0c1820d2c011e3409344d

Member

ezhulenev commented Mar 11, 2019

Based on your PR I've submitted very similar change in d5f6595, apparently this "reduce all outer dimension" is pretty common in other gradient kernels (e.g. in FusedBatchNorm https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/fused_batch_norm_op.cc#L215). Can you please move the code for NCHW case to redux_functors.h, and call it similar way. Not sure though what's a good name for it.

Currently it takes Tensor as an input, but it could be any Eigen expression, I'll work on it after this PR will be merged.

Contributor

JoursBleu commented Mar 12, 2019

Hi @ezhulenev ,
I have read the codes in ReduceOuterDimensions.
So, you want me to add a functor for situations like:
(32, 11, 256, 256) -> (11);
or maybe more general like:
(D1, D2, ... , DN) -> (DM) (where M belong to set [1,N])
and then ref the new functor to do the reduce?

Member

ezhulenev commented Mar 12, 2019 via email

Yeah, that’s the idea. I think you can skip the general case, and just add support for (d1, ... , dn) -> dm where m is in [2, n-1] range. Reducing inner most dimensions are already fast in Eigen, I added optimized case for reducing outer dimensions. We just need one more optimized version for reducing all-except “middle” dimension.

On Mon, Mar 11, 2019 at 11:57 PM Letian ***@***.***> wrote: Hi @ezhulenev <https://github.com/ezhulenev> , I have read the codes in ReduceOuterDimensions. So, you want me to add a functor for situations like: (32, 11, 256, 256) -> (11); or maybe more general like: (D1, D2, ... , DN) -> (DM) (where M belong to set [1,N]) and then ref the new functor to do the reduce? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26426 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHrajX_gC0cA2ojKyuJ4NRtE2P_La-zks5vV0_ugaJpZM4bifIc> .

Contributor

JoursBleu commented Mar 12, 2019

Contributor

JoursBleu commented Mar 14, 2019

@ezhulenev
Could you please provide the command to run the test in bias_op_test.cc?

Member

ezhulenev commented Mar 14, 2019

bazel run .... -- --benchmarks=all I think something like that, but I'm not 100% sure, it might be some difference between internal/external tooling. Teng Lu (@Zantares) from Intel made it work in #26424 (comment), I guess he definitely has the answer :)

JoursBleu added 3 commits

March 15, 2019 19:07


          Merge master in

5678f5e


          1. Add ReduceMiddleDimensions to support middle dim

1c8e841

   reduce in tensorflow/core/kernels/redux_functor.h
2. Opt ReduceOuterDimensions for large inner dim.
3. Rewrite NCHW BiasAddGradOp with ReduceOuterDimensions.


          1. Add ReduceMiddleDimensions to support middle dim

c9c64cd

   reduce in tensorflow/core/kernels/redux_functor.h
2. Opt ReduceOuterDimensions for large inner dim.
3. Rewrite NCHW BiasAddGradOp with ReduceOuterDimensions.

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer

ezhulenev approved these changes

View reviewed changes

rthadur added kokoro:force-run ready to pull labels

kokoro-team removed kokoro:force-run labels

Member

ezhulenev commented Mar 15, 2019

This fails internally with a error:


InvalidArgumentError: var and grad do not have the same shape[1] [3,1]
	 [[node adagrad/slot_11/update_matmul3_b/ApplyAdagrad (defined at {INTERNAL}/training_graph_generator.py:1749) ]]

Errors may have originated from an input operation.
Input Source operations connected to node adagrad/slot_11/update_matmul3_b/ApplyAdagrad:
 gradients/matmul3/BiasAdd_grad/BiasAddGrad (defined at {INTERNAL}//training_graph_generator.py:1306)	
 matmul3_b (defined at {INTERNAL})

didn't have time to debug it yet

Contributor

JoursBleu commented Mar 18, 2019

@ezhulenev
OK, how can I reproduce the error? I can not find the file "training_graph_generator.py" anywhere...

Contributor

JoursBleu commented Mar 20, 2019

@ezhulenev is there anything I can do?

Member

ezhulenev commented Mar 20, 2019

I'll try to prepare reproducible test, right now it's a part of a large model and it's hard to tell whats going on.

Contributor

JoursBleu commented Mar 22, 2019

@ezhulenev Thx. Just tell me if there is anything to do. We need this pr merged ASAP, cause several model is depending on this one.

ezhulenev requested changes

View reviewed changes

tensorflow/core/kernels/redux_functor.h Outdated

+                  for (int i = num_dims - num_reduce_dims; i < num_dims; ++i)
+                    inner_dim *= input_dims[i];
+                  if (1 == inner_dim) {

Member

ezhulenev Mar 22, 2019

Isn't it a bug? '32x32x1' should be reduced to scalar?

Contributor

JoursBleu Mar 22, 2019

Exactly! It should be "outer_dim" here.
Thank you very much for pointing out the bug!
Fixed.

Member

ezhulenev Mar 22, 2019

tensorflow/python/keras:convolutional_test

ValueError: Shapes (3,) and (1, 1, 1, 1, 3) are incompatible

Contributor

JoursBleu Mar 25, 2019

fixed

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes


          Fix logical bug in front of ReduceOuterDimensions.

886c4d8

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer

ezhulenev previously approved these changes

View reviewed changes

rthadur added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label

JoursBleu added 2 commits

March 25, 2019 09:26


          Merge branch 'master' into letian/biasaddgrad_opt

95d0514


          Fix shape of result(input when nothing to do) unmatch output error.

12a5578

JoursBleu dismissed ezhulenev’s stale review via

12a5578

March 25, 2019 02:50

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer

ezhulenev approved these changes

View reviewed changes

tensorflow-copybara merged commit 12a5578 into tensorflow:master

PR Queue automation moved this from Approved by Reviewer to Merged

tensorflow-copybara pushed a commit that referenced this pull request


          Merge pull request #26426 from Intel-tensorflow:letian/biasaddgrad_opt

a7e7492

PiperOrigin-RevId: 240170461

nammbash deleted the letian/biasaddgrad_opt branch

June 16, 2020 19:39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment